CDH 5.3 Hadoop cluster using VirtualBox and QuickStart VM II : Testing

bogotobogo.com site search:

note

Continued from CDH 5.3 Hadoop cluster using VirtualBox and QuickStart VM, in this chapter, we'll test the QuickStart VM with a simple wordcount example:

[cloudera@quickstart ~]$ pwd
/home/cloudera
[cloudera@quickstart ~]$ mkdir temp
[cloudera@quickstart ~]$ ls
cloudera-manager  Desktop    eclipse  Pictures  Templates
cm_api.sh         Documents  lib      Public    Videos
datasets          Downloads  Music    temp      workspace
[cloudera@quickstart ~]$ cd temp
[cloudera@quickstart temp]$ ls
[cloudera@quickstart temp]$ echo "If you torture the data long enough, it will confess." > wordcount.txt

ssh to VM from the Host (in our case, Mac)

First, we need to issue "ifconfig" command from VM:

[cloudera@quickstart ~]$ ifconfig
eth0      Link encap:Ethernet  HWaddr 08:00:27:7B:18:B1  
          inet addr:10.0.2.15  Bcast:10.0.2.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:302205 errors:0 dropped:0 overruns:0 frame:0
          TX packets:172416 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:374769173 (357.4 MiB)  TX bytes:21593675 (20.5 MiB)

eth1      Link encap:Ethernet  HWaddr 08:00:27:4E:45:86  
          inet addr:192.168.56.101  Bcast:192.168.56.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:5444 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1100 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:880947 (860.2 KiB)  TX bytes:207516 (202.6 KiB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:35896200 errors:0 dropped:0 overruns:0 frame:0
          TX packets:35896200 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:19029196779 (17.7 GiB)  TX bytes:19029196779 (17.7 GiB)

Or "ip addr":

[cloudera@quickstart ~]$ ip addr
1: lo:  mtu 16436 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
2: eth0:  mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 08:00:27:7b:18:b1 brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.15/24 brd 10.0.2.255 scope global eth0
3: eth1:  mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 08:00:27:4e:45:86 brd ff:ff:ff:ff:ff:ff
    inet 192.168.56.101/24 brd 192.168.56.255 scope global eth1

We see "inet addr:192.168.56.101" in "eth1" part, and that's the ip to which we can ssh from Mac Terminal:

ip-192-168-1-48:.ssh kihyuckhong$ ssh cloudera@192.168.56.101
The authenticity of host '192.168.56.101 (192.168.56.101)' can't be established.
RSA key fingerprint is 86:23:13:67:60:55:b8:d2:11:89:c8:a2:e4:db:4c:b0.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '192.168.56.101' (RSA) to the list of known hosts.
Connection closed by 192.168.56.101
ip-192-168-1-48:.ssh kihyuckhong$ ssh cloudera@192.168.56.101
cloudera@192.168.56.101's password: 
[cloudera@quickstart ~]$

Now, we're able to ssh into CentOS where Cloudera VM installed!

hadoop help

[cloudera@quickstart ~]$ hadoop
Usage: hadoop [--config confdir] COMMAND
       where COMMAND is one of:
  fs                   run a generic filesystem user client
  version              print the version
  jar             run a jar file
  checknative [-a|-h]  check native hadoop and compression libraries availability
  distcp   copy file or directories recursively
  archive -archiveName NAME -p  *  create a hadoop archive
  classpath            prints the class path needed to get the
  credential           interact with credential providers
                       Hadoop jar and the required libraries
  daemonlog            get/set the log level for each daemon
 or
  CLASSNAME            run the class named CLASSNAME

Most commands print help when invoked w/o parameters.

We can quickly check what do we have:

[cloudera@quickstart ~]$ hadoop fs -ls /user
Found 8 items
drwxr-xr-x   - cloudera cloudera            0 2015-03-24 20:27 /user/cloudera
drwxr-xr-x   - hdfs     supergroup          0 2015-03-14 20:11 /user/hdfs
drwxr-xr-x   - mapred   hadoop              0 2015-03-15 14:08 /user/history
drwxrwxrwx   - hive     hive                0 2014-12-18 04:33 /user/hive
drwxrwxr-x   - hue      hue                 0 2015-03-21 15:34 /user/hue
drwxrwxrwx   - oozie    oozie               0 2014-12-18 04:34 /user/oozie
drwxr-xr-x   - sample   sample              0 2015-03-14 22:05 /user/sample
drwxr-xr-x   - spark    spark               0 2014-12-18 04:34 /user/spark

input to hdfs

Put our input into hdfs:

[cloudera@quickstart temp]$ pwd
/home/cloudera/temp
[cloudera@quickstart temp]$ ls
wordcount.txt

[cloudera@quickstart temp]$ hdfs dfs -mkdir /user/cloudera/input
[cloudera@quickstart temp]$ hdfs dfs -ls /user/cloudera/input
[cloudera@quickstart temp]$ 

[cloudera@quickstart temp]$ hdfs dfs -put /home/cloudera/temp/wordcount.txt /user/cloudera/input
[cloudera@quickstart temp]$ hdfs dfs -ls /user/cloudera/input
Found 1 items
-rw-r--r--   1 cloudera cloudera         54 2015-03-15 17:24 /user/cloudera/input/wordcount.txt

We can also check the input file from UI:

program bundle - wordcount

Let's check the directory: /usr/lib/hadoop-mapreduce/:

[cloudera@quickstart temp]$ ls -ltr /usr/lib/hadoop-mapreduce/
...
lrwxrwxrwx 1 root root      44 Dec 18 04:25 hadoop-mapreduce-examples.jar -> hadoop-mapreduce-examples-2.5.0-cdh5.3.0.jar

Runs a jar file to see the Map Reduce code bundle in a jar file:

cloudera@quickstart temp]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar
An example program must be given as the first argument.
Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
dbcount: An example job that count the pageview counts from a database.
distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.
wordmean: A map/reduce program that counts the average length of the words in the input files.
wordmedian: A map/reduce program that counts the median length of the words in the input files.
wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.

We can locate the code we need : the 4th one from the bottom:

  wordcount: A map/reduce program that counts the words in the input files.

MapReduce Running wordcount

[cloudera@quickstart temp]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount /user/cloudera/input/wordcount.txt /user/cloudera/output
15/03/15 17:49:01 INFO client.RMProxy: Connecting to ResourceManager at quickstart.cloudera/127.0.0.1:8032
15/03/15 17:49:02 INFO input.FileInputFormat: Total input paths to process : 1
15/03/15 17:49:02 INFO mapreduce.JobSubmitter: number of splits:1
15/03/15 17:49:03 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1426453727985_0001
15/03/15 17:49:03 INFO impl.YarnClientImpl: Submitted application application_1426453727985_0001
15/03/15 17:49:03 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1426453727985_0001/
15/03/15 17:49:03 INFO mapreduce.Job: Running job: job_1426453727985_0001
15/03/15 17:49:21 INFO mapreduce.Job: Job job_1426453727985_0001 running in uber mode : false
15/03/15 17:49:21 INFO mapreduce.Job:  map 0% reduce 0%
15/03/15 17:49:37 INFO mapreduce.Job:  map 100% reduce 0%
15/03/15 17:49:48 INFO mapreduce.Job:  map 100% reduce 100%
15/03/15 17:49:48 INFO mapreduce.Job: Job job_1426453727985_0001 completed successfully
15/03/15 17:49:48 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=128
		FILE: Number of bytes written=217765
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=184
		HDFS: Number of bytes written=74
		HDFS: Number of read operations=6
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters 
		Launched map tasks=1
		Launched reduce tasks=1
		Data-local map tasks=1
		Total time spent by all maps in occupied slots (ms)=13334
		Total time spent by all reduces in occupied slots (ms)=5741
		Total time spent by all map tasks (ms)=13334
		Total time spent by all reduce tasks (ms)=5741
		Total vcore-seconds taken by all map tasks=13334
		Total vcore-seconds taken by all reduce tasks=5741
		Total megabyte-seconds taken by all map tasks=13654016
		Total megabyte-seconds taken by all reduce tasks=5878784
	Map-Reduce Framework
		Map input records=1
		Map output records=10
		Map output bytes=94
		Map output materialized bytes=124
		Input split bytes=130
		Combine input records=10
		Combine output records=10
		Reduce input groups=10
		Reduce shuffle bytes=124
		Reduce input records=10
		Reduce output records=10
		Spilled Records=20
		Shuffled Maps =1
		Failed Shuffles=0
		Merged Map outputs=1
		GC time elapsed (ms)=737
		CPU time spent (ms)=730
		Physical memory (bytes) snapshot=389378048
		Virtual memory (bytes) snapshot=1715568640
		Total committed heap usage (bytes)=303366144
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=54
	File Output Format Counters 
		Bytes Written=74
[cloudera@quickstart tem

Here are the log files:

MapReduce output

[cloudera@quickstart temp]$ hdfs dfs -ls /user/cloudera/output
Found 2 items
-rw-r--r--   1 cloudera cloudera          0 2015-03-15 17:49 /user/cloudera/output/_SUCCESS
-rw-r--r--   1 cloudera cloudera         74 2015-03-15 17:49 /user/cloudera/output/part-r-00000

We can check the output directory using UI:

Now, let's see what's in the output: part-r-00000:

[cloudera@quickstart temp]$ hdfs dfs -cat /user/cloudera/output/part-r-00000
If	1
confess.	1
data	1
enough,	1
it	1
long	1
the	1
torture	1
will	1
you	1